Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 17, 2025

RestartAsync_EndToEnd(restartWithNewInstanceId: True) was failing intermittently due to a TOCTOU race condition in WaitForInstanceAsync.

Problem

The original code checked completion status before adding a waiter. If orchestration completed between these two steps, the completion notification was sent to no registered waiter, causing the caller to wait indefinitely (30s timeout).

Fix

Reorder operations to add waiter first, then check status:

// Before: check then register (race window exists)
if (IsCompleted(state)) return state;
var tcs = waiters.GetOrAdd(instanceId, ...);

// After: register then check (no race window)
var tcs = waiters.GetOrAdd(instanceId, ...);
if (IsCompleted(state)) tcs.TrySetResult(state);

This ensures we either:

  • Catch already-completed orchestrations in the status check, or
  • Receive the notification if completion happens after registration
Original prompt

investigate and fix https://github.com/microsoft/durabletask-dotnet/actions/runs/20318223840/job/58367426650?pr=572 the failing testcase Microsoft.DurableTask.Grpc.Tests.DurableTaskGrpcClientIntegrationTests.RestartAsync_EndToEnd(restartWithNewInstanceId: True) [FAIL]


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…End test to fail

The original code had a TOCTOU race condition where a completion notification
could be missed if the orchestration completed between checking completion
status and adding the waiter. The fix reorders the operations to add the
waiter first, then check for completion.

Co-authored-by: YunchuWang <[email protected]>
Copilot AI changed the title [WIP] Investigate and fix failing testcase in DurableTaskGrpcClientIntegrationTests Fix race condition in WaitForInstanceAsync causing intermittent test failures Dec 17, 2025
Copilot AI requested a review from YunchuWang December 17, 2025 22:37
@YunchuWang YunchuWang marked this pull request as ready for review December 17, 2025 22:53
Copilot AI review requested due to automatic review settings December 17, 2025 22:53
@YunchuWang
Copy link
Member

LGTM

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an intermittent race condition in WaitForInstanceAsync that was causing the RestartAsync_EndToEnd test to fail. The issue occurred when an orchestration completed between checking its status and registering a waiter, resulting in missed completion notifications and 30-second timeouts.

Key Changes:

  • Reordered operations in WaitForInstanceAsync to register the waiter before checking completion status, eliminating the race condition window
  • Added clear comments explaining the race condition fix and the two-step registration-then-check pattern

@YunchuWang
Copy link
Member

@copilot update pr description to follow .github/PULL_REQUEST_TEMPLATE.md

@YunchuWang YunchuWang merged commit 27ebd4b into main Dec 17, 2025
8 of 9 checks passed
@YunchuWang YunchuWang deleted the copilot/fix-failing-integration-test branch December 17, 2025 23:42
Copilot AI requested a review from YunchuWang December 17, 2025 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants